Splitting Noun Compounds via Monolingual and Bilingual Paraphrasing: A Study on Japanese Katakana Words
نویسندگان
چکیده
Word boundaries within noun compounds are not marked by white spaces in a number of languages, unlike in English, and it is beneficial for various NLP applications to split such noun compounds. In the case of Japanese, noun compounds made up of katakana words (i.e., transliterated foreign words) are particularly difficult to split, because katakana words are highly productive and are often outof-vocabulary. To overcome this difficulty, we propose using monolingual and bilingual paraphrases of katakana noun compounds for identifying word boundaries. Experiments demonstrated that splitting accuracy is substantially improved by extracting such paraphrases from unlabeled textual data, the Web in our case, and then using that information for constructing splitting models.
منابع مشابه
Extracting French-Japanese Word Pairs from Bilingual Corpora based on Transliteration Rules
It has been shown so far that using transliteration rules to extract Japanese Katakana and English word pairs is highly useful and promising. But for Japanese-French pairs, the method is not guaranteed to work, because only a very few Japanese Katakana words are borrowed directly from French. In this paper we will show the possibility of extracting Japanese Katakana and French word pairs based ...
متن کاملAutomatically Harvesting Katakana-English Term Pairs from Search Engine Query Logs
This paper describes a method of extracting katakana words and phrases, along with their English counterparts from non-aligned monolingual web search engine query logs. The method employs a trainable edit distance function to find pairs that have a high probability of being equivalent. These pairs can then be used to further bootstrap training of the edit distance function, ...
متن کاملComparing and Extracting Paraphrasing Words with 2-Way Bilingual Dictionaries
We analyze a variety of lexical expressions with 2-way bilingual dictionaries and propose a method for extracting paraphrasing words. First, we compare the coverage between an English-Japanese dictionary and a Japanese-English dictionary from the viewpoint of the returnability of the words by translating English to Japanese, and then back to English again. The variety is shown using examples. N...
متن کاملAutomatic Acquisition of Basic Katakana Lexicon from a Given Corpus
Katakana, Japanese phonogram mainly used for loan words, is a trou-blemaker in Japanese word segmentation. Since Katakana words are heavily domain-dependent and there are many Katakana neologisms, it is almost impossible to construct and maintain Katakana word dictionary by hand. This paper proposes an automatic segmentation method of Japanese Katakana compounds, which makes it possible to cons...
متن کاملAutomatic Extraction of Translational Japanese-KATAKANA and English Word Pairs
The method to automatically extract translational Japanese-KATAKANA and English word pairs from bilingual corpora is proposed. The method applies all the existing transliteration rules to each mora unit in a KATAKANA word, and extract English word which matched or partially-matched to one of these transliteration candidates as translation. For instance, if there is a word ‘グラフ’ (graph) in Japan...
متن کامل